Abstract:For acoustic modeling of small languages with rare resource, a perplexity based approach is proposed to select unsupervised data in the decoding transcription and retrain the acoustic model. The large unsupervised corpus is decoded using the initial acoustic model trained with a small amount of labeled data, and the perplexity between the decoded text and the training set is calculated. Then, the selected data similar to the labeled data are used to train the acoustic model along with the labeled data. To improve the correctness of the decoded unsupervised data,the final network parameters of acoustic model are adjusted by only using the correct labeled data in the last iteration during the training of model parameters based on deep neural network. In the VLLP recognition task of Swahili provided by NIST 2015 open keyword search competition, the proposed approach can improve the recognition rate compared with other methods.
[1] JAITLY N, HINTON G E. Vocal Tract Length Perturbation (VTLP) Improves Speech Recognition [C/OL].[2013-10-23]. http://www.cs.toronto.edu/~ndjaitly/jaitly-icml13.pdf. [2] KANDA N, TAKEDA R, OBUCHI Y. Elastic Spectral Distortion for Low Resource Speech Recognition with Deep Neural Networks // Proc of the IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE, 2013: 309-314. [3] CUI X D, GOEL V, KINGSBURY B. Data Augmentation for Deep Neural Network Acoustic Modeling // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014: 5582-5586. [4] FRAGA-SILVA T, GAUVAIN J L, LAMEL L, et al. Active Learning Based Data Selection for Limited Resource STT and KWS[C/OL].[2015-09-06]. http://www.vocapia.com/pub lications/IS2015-AL.pdf. [5] NI C J, LEUNG C C, WANG L, et al. Unsupervised Data Selection and Word-Morph Mixed Language Model for Tamil Low-Resource Keyword Search // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia: IEEE, 2015: 4714-4718. [6] MA J, MATSOUKAS S, KIMBALL O, et al. Unsupervised Training on Large Amounts of Broadcast News Data // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Toulouse, France: IEEE, 2006. DOI:10.1109/ICASSP. 2006.1660839. [7] LAMEL L, GAUVAIN J L, ADDA G. Unsupervised Acoustic Model Training // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Orlando, USA: IEEE, 2002, I: 877-880. [8] DUTAGˇACI H. Statistical Language Models for Large Vocabulary Turkish Speech Recognition[EB/OL]. [2015-09-03]. http: // busim.ee.boun.edu.tr/~speech/publications/Theses/helin-du tagaci.pdf. [9] HINTON G E, OSINDERO S, TEH Y W. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 2006, 18(7): 1527-1554. [10] DAHL G E, YU D, DENG L, et al. Context-Dependent Pre-trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1): 30-42. [11] LIN H, DENG L, YU D, et al. A Study on Multilingual Acoustic Modeling for Large Vocabulary ASR // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Taibei, China: IEEE, 2009: 4333-4336. [12] HUANG J T, LI J Y, YU D, et al. Cross-Language Knowledge Transfer Using Multilingual Deep Neural Network with Shared Hidden Layers // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013: 7304-7308. [13] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi Speech Recognition Toolkit [C/OL]. [2015-09-13]. http:// publications.idiap.ch/downloads/papers/2012/PoveyASRLL- 2011-2011.pdf. [14] GHAHREMANI P, BABAALI B, POVEY D, et al. A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014: 2494-2498. [15] VIRPIOJA S, SMIT P, GRNROOS S A, et al. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. [EB/OL]. [2015-09-13]. https://aaltodoc.aalto.fi/bitst ream/handle/123456789/11836/isbn9789526055015.pdf?sequ ence=1&isAllowed=y. [16] XU H H, SU H, CHNG E S, et al. Semi-supervised Training for Bottle-Neck Feature Based DNN-HMM Hybrid Systems // Proc of the 15th Annual Conference of the International Speech Communication Association. Glenelg North, USA: Causal Productions Pty Ltd, 2014: 2078-2082.